class: center, middle, inverse, title-slide # Introduction to R for Data Analysis ## Data Wrangling Advanced ### Johannes Breuer & Stefan Jünger ### 2021-08-03 --- layout: true --- ## Data wrangling continued 🤠 While in the last sessions we focused on the bread-and-butter tasks of the data preparation business, in this part we will focus on the more 'programmy' side of things. - altering the content of a whole set of variables - conditional variable transformation - formulating logical requests to our data - writing loops --- class: middle **We will switch between the world of `base R` and the `tidyverse` as it also a good lesson that it is not necessary to rely either on one of them. Yet, for both we cannot be comprehensive, which is why we will show routines that may be important for your first steps after the course and for the rest of the week.** --- ## Load the data Again, we will work with the *Public Use File (PUF) of the GESIS Panel Special Survey on the Coronavirus SARS-CoV-2 Outbreak in Germany* as `.csv` file. ```r gp_covid <- read_csv2("./data/ZA5667_v1-1-0.csv") ``` --- ## Quickly define missing values ```r library(sjlabelled) gp_covid <- gp_covid %>% set_na(na = c(-99, -77, -33, 98)) ``` --- ## Variables of interest Say, we are interested in the (dis)trust towards several authorities during the Corona crisis. There are 9 items on this topic. Let's create some quick on 3 of them. ```r table(gp_covid$hzcy044a) ``` ``` ## ## 1 2 3 4 5 ## 43 174 329 1250 1269 ``` ```r table(gp_covid$hzcy047a) ``` ``` ## ## 1 2 3 4 5 ## 29 61 188 1054 1763 ``` ```r table(gp_covid$hzcy052a) ``` ``` ## ## 1 2 3 4 5 ## 25 79 303 1422 1278 ``` What if we want to conduct some data reduction method (e.g., PCA) and need the variables in reverse order for interpretation purposes? --- ## Recode data **across** defined variables The `dplyr` package provides a (new) handy tool to exactly this: `across()`. This function can be used to apply another function to multiple variables at once. ```r gp_covid <- gp_covid %>% mutate( across( hzcy044a:hzcy052a, ~recode( .x, `5` = 1, # `old value` = new value `4` = 2, `2` = 4, `1` = 5 ) ) ) ``` --- class: middle ```r table(gp_covid$hzcy044a) ``` ``` ## ## 1 2 3 4 5 ## 1269 1250 329 174 43 ``` ```r table(gp_covid$hzcy047a) ``` ``` ## ## 1 2 3 4 5 ## 1763 1054 188 61 29 ``` ```r table(gp_covid$hzcy052a) ``` ``` ## ## 1 2 3 4 5 ## 1278 1422 303 79 25 ``` --- ## Using `across()` across logical conditions Sometimes we are interested in variables that meet certain conditions. For example, for an anylsis, we want to z-standardize all numeric variables in a dataset. Let's create a temporary subset of our data to exemplify such efforts. ```r gp_covid_tmp <- gp_covid %>% select(doi, hzcy044a:hzcy052a) gp_covid_tmp %>% sample_n(5) # randomly sample 5 cases from the df ``` ``` ## # A tibble: 5 x 10 ## doi hzcy044a hzcy045a hzcy046a hzcy047a hzcy048a hzcy049a hzcy050a hzcy051a hzcy052a ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 10.4232/1.13520 1 1 1 1 1 1 1 1 1 ## 2 10.4232/1.13520 3 2 3 2 2 2 2 1 1 ## 3 10.4232/1.13520 3 3 3 3 3 3 3 3 2 ## 4 10.4232/1.13520 3 3 3 1 2 2 2 1 1 ## 5 10.4232/1.13520 1 1 1 1 2 2 2 1 1 ``` --- ## Example: z-standardize all numeric variables The `base R` function to z-standardize a variable is `scale()`. ```r gp_covid_tmp <- gp_covid_tmp %>% mutate( across( is.numeric, ~scale(.x) ) ) gp_covid_tmp %>% sample_n(5) ``` ``` ## # A tibble: 5 x 10 ## doi hzcy044a[,1] hzcy045a[,1] hzcy046a[,1] hzcy047a[,1] hzcy048a[,1] hzcy049a[,1] hzcy050a[,1] hzcy051a[,1] hzcy052a[,1] ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 10.4232/1.13520 0.164 NA NA 0.570 0.655 0.499 0.824 -0.0341 1.57 ## 2 10.4232/1.13520 2.33 2.02 1.51 3.15 1.64 1.37 1.83 -0.0341 0.302 ## 3 10.4232/1.13520 0.164 NA NA -0.722 -0.332 -0.372 -0.179 -1.09 -0.962 ## 4 10.4232/1.13520 -0.920 -0.214 0.462 -0.722 -0.332 -0.372 -0.179 -0.0341 0.302 ## 5 10.4232/1.13520 -0.920 -1.33 -1.63 NA -1.32 -1.24 -1.18 NA NA ``` --- ## `dplyr::across()` <img src="data:image/png;base64,#C:\Users\mueller2\talks_presentations\r-intro-gesis-2021\content\img\across_blank.png" width="95%" style="display: block; margin: auto;" /> <small><small>Artwork by [Allison Horst](https://github.com/allisonhorst/stats-illustrations)</small></small> --- ## Aggregate variables across rows Something we might want to do for our analyses is to create aggregate variables, such as sum or mean scores for a set of items. As `dplyr` operations are applied to columns, whereas such aggregations relate to rows (i.e., respondents), we need to make use of the function `rowwise()`. Say, for example, we want to compute a sum score for all measures that respondents have reported to engage in to prevent an infection with or the spread of the Corona virus. ```r gp_covid <- gp_covid %>% rowwise() %>% #<< mutate( sum_trust = sum( c_across(hzcy044a:hzcy052a), na.rm = TRUE ) ) %>% ungroup() ``` --- ## Aggregate variables Three things to note here: 1. `c_across()` is a special version of `across()`for rowwise operations. 2. We use the `ungroup()` function at the end to ensure that `dplyr` verbs will operate the default way when we further work with the `gpc` object. We do not cover grouping in this course (which is especially valuable for summarizing data), but you can check out the [documentation for `group_by()`](https://dplyr.tidyverse.org/reference/group_by.html) to learn more about this. 3. If you only need sums or means, a somewhat faster alternative is using the base `R` functions `rowSums()` and `rowMeans()` in combination with `mutate()` (and possibly also `across()` plus selection helpers). For an explanation why this can be faster, you can read the [online documentation for `rowwise()`](https://dplyr.tidyverse.org/articles/rowwise.html). --- ## Aggregate variables ```r gp_covid %>% select(hzcy044a:hzcy052a, sum_trust) %>% glimpse() ``` ``` ## Rows: 3,765 ## Columns: 10 ## $ hzcy044a <dbl> NA, 1, 2, 2, NA, 2, 2, 2, NA, 1, 3, 1, NA, 2, 3, NA, NA, 1, NA, 2, NA, 1, 1, NA, 2, 1, 2, 1, 2, NA, NA, 1, NA, 2, 2, 2, 1, 3~ ## $ hzcy045a <dbl> NA, 2, 2, 2, NA, 1, 2, 2, NA, 2, 2, 3, NA, 3, 3, NA, NA, 1, NA, 4, NA, 2, 1, NA, 2, NA, NA, 3, 2, NA, NA, 1, NA, 2, NA, 2, N~ ## $ hzcy046a <dbl> NA, 2, 1, 2, NA, 2, 2, 2, NA, 3, 4, 4, NA, 3, 3, NA, NA, 1, NA, 4, NA, 3, 2, NA, 2, 2, 3, 3, 2, NA, NA, 3, NA, 2, 4, 3, 3, 3~ ## $ hzcy047a <dbl> NA, 1, 1, 2, NA, 1, 2, 1, NA, 2, 2, 1, NA, 2, 2, NA, NA, 2, NA, 1, NA, 1, 1, 2, 2, 2, 2, 2, 1, NA, NA, 3, NA, 2, 2, 1, 1, 3,~ ## $ hzcy048a <dbl> NA, 2, 2, 2, NA, 2, 3, 2, NA, 3, 4, 5, NA, 2, 4, NA, NA, 2, NA, 1, NA, 3, 2, 2, 2, 2, 2, 2, 2, NA, NA, 4, NA, 3, 3, 2, 1, 3,~ ## $ hzcy049a <dbl> NA, 2, 3, 2, NA, 4, 3, 2, NA, 4, 5, 4, NA, 2, 4, NA, NA, 2, NA, NA, NA, 3, 2, 2, 2, 2, 2, 2, 3, NA, NA, 4, NA, 5, 3, 2, 1, 3~ ## $ hzcy050a <dbl> NA, 2, 2, 2, NA, 2, 3, 2, NA, 4, 4, 5, NA, 2, 2, NA, NA, 1, NA, 2, NA, 3, 2, 2, 2, 2, 2, 2, 2, NA, NA, 2, NA, 2, 3, 1, 2, 3,~ ## $ hzcy051a <dbl> NA, 2, 4, 2, NA, 3, 4, 1, NA, 1, 2, 3, NA, 2, 1, NA, NA, 1, NA, 3, NA, 4, 1, 2, 2, 2, 2, 3, 2, NA, NA, 3, NA, 2, 3, 1, 1, 3,~ ## $ hzcy052a <dbl> NA, 2, 1, 2, NA, 1, 2, 1, NA, 1, 2, 2, NA, 1, 1, NA, NA, 1, NA, 1, NA, 2, 2, 2, 2, 2, 3, 3, 1, NA, NA, 4, NA, 2, 2, 2, 2, 2,~ ## $ sum_trust <dbl> 0, 16, 18, 18, 0, 18, 23, 15, 0, 21, 28, 28, 0, 19, 23, 0, 0, 12, 0, 18, 0, 22, 14, 12, 18, 15, 18, 21, 17, 0, 0, 25, 0, 22,~ ``` --- ## Example: Aggregate variables based on means Rowwise transformations work the same way for means. Here, we create a mean score for the items that ask how much people trust specific people or institutions in dealing with the Corona virus. ```r gp_covid <- gp_covid %>% rowwise() %>% mutate( mean_trust = mean( c_across(hzcy044a:hzcy052a), na.rm = TRUE ) ) %>% ungroup() ``` --- class: middle ```r gp_covid %>% select(hzcy044a:hzcy052a, mean_trust) %>% glimpse() ``` ``` ## Rows: 3,765 ## Columns: 10 ## $ hzcy044a <dbl> NA, 1, 2, 2, NA, 2, 2, 2, NA, 1, 3, 1, NA, 2, 3, NA, NA, 1, NA, 2, NA, 1, 1, NA, 2, 1, 2, 1, 2, NA, NA, 1, NA, 2, 2, 2, 1, ~ ## $ hzcy045a <dbl> NA, 2, 2, 2, NA, 1, 2, 2, NA, 2, 2, 3, NA, 3, 3, NA, NA, 1, NA, 4, NA, 2, 1, NA, 2, NA, NA, 3, 2, NA, NA, 1, NA, 2, NA, 2, ~ ## $ hzcy046a <dbl> NA, 2, 1, 2, NA, 2, 2, 2, NA, 3, 4, 4, NA, 3, 3, NA, NA, 1, NA, 4, NA, 3, 2, NA, 2, 2, 3, 3, 2, NA, NA, 3, NA, 2, 4, 3, 3, ~ ## $ hzcy047a <dbl> NA, 1, 1, 2, NA, 1, 2, 1, NA, 2, 2, 1, NA, 2, 2, NA, NA, 2, NA, 1, NA, 1, 1, 2, 2, 2, 2, 2, 1, NA, NA, 3, NA, 2, 2, 1, 1, 3~ ## $ hzcy048a <dbl> NA, 2, 2, 2, NA, 2, 3, 2, NA, 3, 4, 5, NA, 2, 4, NA, NA, 2, NA, 1, NA, 3, 2, 2, 2, 2, 2, 2, 2, NA, NA, 4, NA, 3, 3, 2, 1, 3~ ## $ hzcy049a <dbl> NA, 2, 3, 2, NA, 4, 3, 2, NA, 4, 5, 4, NA, 2, 4, NA, NA, 2, NA, NA, NA, 3, 2, 2, 2, 2, 2, 2, 3, NA, NA, 4, NA, 5, 3, 2, 1, ~ ## $ hzcy050a <dbl> NA, 2, 2, 2, NA, 2, 3, 2, NA, 4, 4, 5, NA, 2, 2, NA, NA, 1, NA, 2, NA, 3, 2, 2, 2, 2, 2, 2, 2, NA, NA, 2, NA, 2, 3, 1, 2, 3~ ## $ hzcy051a <dbl> NA, 2, 4, 2, NA, 3, 4, 1, NA, 1, 2, 3, NA, 2, 1, NA, NA, 1, NA, 3, NA, 4, 1, 2, 2, 2, 2, 3, 2, NA, NA, 3, NA, 2, 3, 1, 1, 3~ ## $ hzcy052a <dbl> NA, 2, 1, 2, NA, 1, 2, 1, NA, 1, 2, 2, NA, 1, 1, NA, NA, 1, NA, 1, NA, 2, 2, 2, 2, 2, 3, 3, 1, NA, NA, 4, NA, 2, 2, 2, 2, 2~ ## $ mean_trust <dbl> NaN, 1.777778, 2.000000, 2.000000, NaN, 2.000000, 2.555556, 1.666667, NaN, 2.333333, 3.111111, 3.111111, NaN, 2.111111, 2.5~ ``` --- class: center, middle # [Exercise](XXX) time 🏋️♀️💪🏃🚴 ## [Solutions](XXX) --- class: middle **Sometimes, things are a bit more complicated. Simple recoding is insufficient when we need to base new variables based on the values of old variable(s). Such procedures are can be called conditional transformation.** --- ## Simple conditional transformation The simplest version of a conditional variable transformation is using an `ifelse()` statement. ```r gp_covid <- gp_covid %>% mutate( high_education = ifelse(education_cat == 3, "high", "not so high") ) gp_covid %>% select(education_cat, high_education) %>% sample_n(5) ``` ``` ## # A tibble: 5 x 2 ## education_cat high_education ## <dbl> <chr> ## 1 3 high ## 2 3 high ## 3 2 not so high ## 4 3 high ## 5 3 high ``` .small[ *Note*: A more versatile option for creating dummy variables is the [`fastDummies` package](https://jacobkap.github.io/fastDummies/). ] --- ## Advanced conditional transformation For more flexible (or complex) conditional transformations, the `case_when()` function from `dyplyr` is a powerful tool. ```r gp_covid <- gp_covid %>% mutate( pol_leaning_cat = case_when( between(political_orientation, 0, 3) ~ "left", between(political_orientation, 4, 7) ~ "center", political_orientation > 7 ~ "right" ) ) gp_covid %>% select(political_orientation, pol_leaning_cat) %>% sample_n(5) ``` ``` ## # A tibble: 5 x 2 ## political_orientation pol_leaning_cat ## <dbl> <chr> ## 1 2 left ## 2 3 left ## 3 5 center ## 4 8 right ## 5 3 left ``` --- ## Conditional transformation based on multiple values ```r gp_covid <- gp_covid %>% mutate( pol_leaning_edu = case_when( between(political_orientation, 0, 3) & high_education == "high" ~ "left high", between(political_orientation, 4, 7) & high_education == "high" ~ "center high", political_orientation > 7 & high_education == "high" ~ "right high", TRUE ~ "not so high" ) ) gp_covid %>% select(political_orientation, high_education, pol_leaning_edu) %>% sample_n(5) ``` ``` ## # A tibble: 5 x 3 ## political_orientation high_education pol_leaning_edu ## <dbl> <chr> <chr> ## 1 6 not so high not so high ## 2 6 high center high ## 3 5 not so high not so high ## 4 5 high center high ## 5 3 high left high ``` --- ## `dplyr::case_when()` A few things to note about `case_when()`: - you can have multiple conditions per value - conditions are evaluated consecutively - when none of the specified conditions are met for an observation, by default, the new variable will have a missing value `NA` for that case - if you want some other value in the new variables when the specified conditions are not met, you need to add `TRUE ~ value` as the last argument of the `case_when()` call - to explore the full range of options for `case_when()` check out its [online documentation](https://dplyr.tidyverse.org/reference/case_when.html) or run `?case_when()` in `R`/*RStudio* --- ## `dplyr::case_when()` <img src="data:image/png;base64,#C:\Users\mueller2\talks_presentations\r-intro-gesis-2021\content\img\dplyr_case_when.png" width="95%" style="display: block; margin: auto;" /> <small><small>Artwork by [Allison Horst](https://github.com/allisonhorst/stats-illustrations)</small></small> --- class: center, middle # [Exercise](XXX) time 🏋️♀️💪🏃🚴 ## [Solutions](XXX) --- ## Get a bit more programmy So far, all of the previous tasks share two characteristics - based on the structure of the whole dataset - the output is again the whole dataset Particularly in data analysis, our aim is often to extract information from a dataset (e.g., summary statistics, regression estimates). We now will learn a bit more about - writing functions - if-else loops - for loops and the like - modern `tidyverse` implementations --- ## Functional Programming: In `R`, everything's a function (more or less) So you might already be familiar with using functions in `R` (at least we have used them heavily on the previous slides). Functions are applied as shown here: ```r fancy_function(data) ``` They can be nested, for example: ```r log(sum(c(1, 2, 3))) ``` ``` ## [1] 1.791759 ``` --- ## Defining your own function is straightforward First, let's create a simple function that adds `1` to an entered number. ```r add_one <- function (a_number) { a_number + 1 } ``` Now, we can simply apply it to some data as in any other `R` function. ```r add_one(2) ``` ``` ## [1] 3 ``` ```r add_one(99) ``` ``` ## [1] 100 ``` --- ## Extend the sum function ```r sum_na <- function (x) { sum(x, na.rm = TRUE) } ``` --- ## Feed it into `mutate()` and `across()` ```r gp_covid <- gp_covid %>% rowwise() %>% mutate( new_sum_trust = sum_na(c_across(hzcy044a:hzcy052a)) ) %>% ungroup() gp_covid %>% select(hzcy044a:hzcy052a, new_sum_trust) %>% glimpse() ``` ``` ## Rows: 3,765 ## Columns: 10 ## $ hzcy044a <dbl> NA, 1, 2, 2, NA, 2, 2, 2, NA, 1, 3, 1, NA, 2, 3, NA, NA, 1, NA, 2, NA, 1, 1, NA, 2, 1, 2, 1, 2, NA, NA, 1, NA, 2, 2, 2, ~ ## $ hzcy045a <dbl> NA, 2, 2, 2, NA, 1, 2, 2, NA, 2, 2, 3, NA, 3, 3, NA, NA, 1, NA, 4, NA, 2, 1, NA, 2, NA, NA, 3, 2, NA, NA, 1, NA, 2, NA, ~ ## $ hzcy046a <dbl> NA, 2, 1, 2, NA, 2, 2, 2, NA, 3, 4, 4, NA, 3, 3, NA, NA, 1, NA, 4, NA, 3, 2, NA, 2, 2, 3, 3, 2, NA, NA, 3, NA, 2, 4, 3, ~ ## $ hzcy047a <dbl> NA, 1, 1, 2, NA, 1, 2, 1, NA, 2, 2, 1, NA, 2, 2, NA, NA, 2, NA, 1, NA, 1, 1, 2, 2, 2, 2, 2, 1, NA, NA, 3, NA, 2, 2, 1, 1~ ## $ hzcy048a <dbl> NA, 2, 2, 2, NA, 2, 3, 2, NA, 3, 4, 5, NA, 2, 4, NA, NA, 2, NA, 1, NA, 3, 2, 2, 2, 2, 2, 2, 2, NA, NA, 4, NA, 3, 3, 2, 1~ ## $ hzcy049a <dbl> NA, 2, 3, 2, NA, 4, 3, 2, NA, 4, 5, 4, NA, 2, 4, NA, NA, 2, NA, NA, NA, 3, 2, 2, 2, 2, 2, 2, 3, NA, NA, 4, NA, 5, 3, 2, ~ ## $ hzcy050a <dbl> NA, 2, 2, 2, NA, 2, 3, 2, NA, 4, 4, 5, NA, 2, 2, NA, NA, 1, NA, 2, NA, 3, 2, 2, 2, 2, 2, 2, 2, NA, NA, 2, NA, 2, 3, 1, 2~ ## $ hzcy051a <dbl> NA, 2, 4, 2, NA, 3, 4, 1, NA, 1, 2, 3, NA, 2, 1, NA, NA, 1, NA, 3, NA, 4, 1, 2, 2, 2, 2, 3, 2, NA, NA, 3, NA, 2, 3, 1, 1~ ## $ hzcy052a <dbl> NA, 2, 1, 2, NA, 1, 2, 1, NA, 1, 2, 2, NA, 1, 1, NA, NA, 1, NA, 1, NA, 2, 2, 2, 2, 2, 3, 3, 1, NA, NA, 4, NA, 2, 2, 2, 2~ ## $ new_sum_trust <dbl> 0, 16, 18, 18, 0, 18, 23, 15, 0, 21, 28, 28, 0, 19, 23, 0, 0, 12, 0, 18, 0, 22, 14, 12, 18, 15, 18, 21, 17, 0, 0, 25, 0,~ ``` --- ## if-else architecture in `R` Using if-else statements in `R` requires at least 3 steps: 1. Starting the loop with `if()` 2. Add the condition to be tested in the parentheses of the `if(condition)` 3. Write a function or procedure on data in the curly brackets of the `if(condition){ ... }` For example: ```r if (1 < 2) { 1 + 2 } ``` ``` ## [1] 3 ``` --- ## Adding else statements In a fourth step, we can add an `else { ... }` ```r if (1 > 2) { 1 + 2 } else { 2 + 5 } ``` ``` ## [1] 7 ``` So, the general architecture is like this: ```r if (condition) { function_to_apply(data) } else { other_function_to_apply(data) } ``` (We could also test for another condition in the else statements with `else if()`) --- ## Adding it to our function! ```r descriptives_na <- function(x, statistic) { if (statistic == "sum") { sum(x, na.rm = TRUE) } else if (statistic == "mean") { mean(x, na.rm = TRUE) } else { stop("no valid statistic provided!") } } ``` --- ## Trying it out ```r descriptives_na(c(1, 2), statistic = "sum") ``` ``` ## [1] 3 ``` ```r descriptives_na(c(1, 2), statistic = "mean") ``` ``` ## [1] 1.5 ``` ```r descriptives_na(c(1, 2), statistic = "mode") ``` ``` ## Error in descriptives_na(c(1, 2), statistic = "mode"): no valid statistic provided! ``` --- class: center, middle # [Exercise](XXX) time 🏋️♀️💪🏃🚴 ## [Solutions](XXX) --- ## `for()` loops (Simple) loops using the `for()` function are some of the most useful tools in functional programming. They, e.g., enable iterating through input data and applying functions to each element of the data - it depends on the specific purpose what defines this element - the elements can be rows, columns, list elements, etc. - hence, it is crucial to think about the iterator of the specific call --- ## Architecture of for-loops ```r for (iterator_name in data) { function_to_apply(iterator_name) } ``` --- ## Calculating means of all trust variables ```r variables_vector <- c( "hzcy044a", "hzcy044a", "hzcy044a", "hzcy047a", "hzcy048a", "hzcy049a", "hzcy050a", "hzcy051a", "hzcy052a" ) for (variable in variables_vector) { print( descriptives_na( gp_covid[[variable]], statistic = "mean" ) ) } ``` ``` ## [1] 1.84894 ## [1] 1.84894 ## [1] 1.84894 ## [1] 1.558643 ## [1] 2.336631 ## [1] 2.427157 ## [1] 2.178297 ## [1] 2.032227 ## [1] 1.761184 ``` --- ## The apply family The apply family is aimed to make your life a bit easier when writing `base R` loops. - provides a friendly interface to enter your data - data come out in a standard format - it may be faster than, e.g., writing a `for()` loop However, we won't cover all functions of this precious family - **`apply()`** - **`lapply()`** - **`sapply()`** - **`tapply()`** - `mapply()`, `rapply()`, & `vapply()` are left out --- ## apply() The `apply()` function is useful when you want to fire up a short command across either all columns (option `MARGIN = 2`) _or_ rows (option `MARGIN = 1`). ```r # means across columns/variables apply(gp_covid[,20:24], 2, function (x) descriptives_na(x, statistic = "mean")) ``` ``` ## hzcy007a hzcy008a hzcy009a hzcy010a hzcy011a ## 0.80288763 0.46547395 0.01851852 0.08600126 0.91054614 ``` ```r # means across rows/observations apply(gp_covid[1:10,20:24], 1, function (x) descriptives_na(x, statistic = "mean")) ``` ``` ## [1] NaN 0.2 0.4 0.4 NaN 0.0 0.4 0.4 NaN 0.6 ``` While there are plenty of functions for building descriptive tables already out there (e.g., `psych::describe` ), this becomes handy when you want to create them yourself. --- ## lapply() `lapply()` is for more elaborated operations. However, there are no `MARGIN` options, so let's see what happens when we use similar to what we did before: ```r lapply(gp_covid[,20:24], function (x) descriptives_na(x, statistic = "mean")) ``` ``` ## $hzcy007a ## [1] 0.8028876 ## ## $hzcy008a ## [1] 0.4654739 ## ## $hzcy009a ## [1] 0.01851852 ## ## $hzcy010a ## [1] 0.08600126 ## ## $hzcy011a ## [1] 0.9105461 ``` --- ## lapply() returns lists It might be a little bit uncomfortable, but `lapply()` returns each result of an iterated operation as a list element. Thus, the output of applying the function is a list. Some people don't like lists as they separate information from each other. I like lists. --- ## sapply() `sapply()` is similar to `lapply()`. The minor but significant difference is that it returns vectors instead of lists. When you want to add the results of this function as a new column to your existing data, this comes in handy. ```r sapply(gp_covid[,20:24], function (x) descriptives_na(x, statistic = "mean")) ``` ``` ## hzcy007a hzcy008a hzcy009a hzcy010a hzcy011a ## 0.80288763 0.46547395 0.01851852 0.08600126 0.91054614 ``` --- ## tapply() Finally, `tapply()` is useful when you want to perform an action across different groups in your data. ```r tapply( gp_covid$political_orientation, gp_covid$sex, function (x) descriptives_na(x, statistic = "mean") ) ``` ``` ## 1 2 ## 4.893819 4.412325 ``` --- ## Modern stuff from the `tidyverse`'s `purrr` package .pull-left[ Thus far, our examples have not been that complicated - we had one specific task to perform and the input data were not complex ] .pull-right[ Sometimes, things are a bit more complicated - the data have to be wrangled before the actual loop ] .pull-left[ <img src="data:image/png;base64,#C:\Users\mueller2\talks_presentations\r-intro-gesis-2021\content\img\purrr_logo.png" width="160" style="display: block; margin: auto;" /> ] .pull-right[ **`purrr` provides a collection of functions that also integrate nicely into a `%>%` workflow.** ] --- ## A simple `map()` example ```r library(purrr) gp_covid %>% select(sex, hzcy044a:hzcy052a) %>% group_by(sex) %>% group_split(sex, .keep = FALSE) %>% map(~as.matrix(.x)) %>% map_dbl(~descriptives_na(.x, statistic = "mean")) ``` ``` ## [1] 2.147828 2.046260 ``` --- ## `purrr::map()` A few things to note about `map()`: - `map()` usually expects a list as input - this is why we split our data into two lists - a function is applied to each list element with a preceding `~` operator - per default, `map()` returns the results also as a list - yet, there are pre-defined `map()`-flavors that return other data types (e.g., the used `map_dbl()`) - you may want to have a look at the help page using `?map` for a comprehensive overview **We will re-use the `purrr` capabilities later this week when we wrangle multiple regression models at the same time.** --- ## `purrr::map()` <img src="data:image/png;base64,#C:\Users\mueller2\talks_presentations\r-intro-gesis-2021\content\img\map_frosting.png" width="95%" style="display: block; margin: auto;" /> <small><small>Artwork by [Allison Horst](https://github.com/allisonhorst/stats-illustrations)</small></small> --- class: center, middle # [Exercise](XXX) time 🏋️♀️💪🏃🚴 ## [Solutions](XXX) --- # Extracurricular activities `R` can also be used for creating text-based adventure games. Play the fun short text adventure ["Castle of R"](https://github.com/gsimchoni/CastleOfR) which was designed to test your programming skills using `base R`. Also check out the [background](http://giorasimchoni.com/2017/09/10/2017-09-10-you-re-in-a-room-the-castleofr-package/) of the programming of the game/package.